Getting started GenAI & LLM with my Udemy course, Hands-on Generative AI Engineering with Large Language Model 👇
Pretrained language models possess broad knowledge acquired from vast amounts of unlabeled data. It is trained to predict the next token. This makes it unadaptable for QA tasks. In addition, the pretrained general knowledge may not be sufficient for specific tasks or domains where the knowleadge is closed to the public. Those senario are where Supervised Fine-Tuning (SFT) comes into play. SFT allows us to extend the model’s capabilities to perform targeted tasks more accurately. It allows the model to learn task-specific patterns and nuances not present in general pretraining data.
There are three main types of Supervised Fine-Tuning (SFT) for large language models:
Full Model Fine-Tuning. This approach involves updating all parameters of the pre-trained model. It offers maximum flexibility in adapting the model to specialized tasks. Often yields significant performance improvements but requires substantial computational resources.
Feature-Based Fine-Tuning. This method focuses on extracting features from the pre-trained model and used as input for another model or classifier. The main pre-trained model remains unchanged. It’s less resource-intensive and provides faster results, making it suitable when computational power is limited.
Parameter-Efficient Fine-Tuning (PEFT). PEFT techniques aim to fine-tune models more efficiently. Only a portion of the model’s weights are modified, leaving the fundamental language understanding intact. It adds task-specific layers or adapters to the pre-trained model. Significantly reduces computational costs compared to full fine-tuning while still achieving competitive performance.
The choice between these approaches is based on the specific requirements of the task, available computational resources, and desired model performance.
In this article, we will discuss the two most popular and effective PEFT techniques: LoRA and QLoRA.
Supervised Fine Tuning with LoRA and QLoRA
LoRA (Low-Rank Adaptation) is introduced in 2021 in the paper “LoRA: Low-Rank Adaptation of Large Language Models” by Adward et al.. It then has gained widespread adoption. It is a cost-effective and efficient method for adapting pretrained language models to specific tasks by freezing most of the model’s parameters and updating only a small number of task-specific weights. This approach leverages adapters to reduce the training overhead, making it an attractive solution for limited compute scenarios.
QLoRA (Quantized Low-Rank Adaptation) is an extension of the LoRA technique. It is proposed in the paper “QLoRA: Efficient Finetuning of Quantized LLMs” by Tim et al. in 2023. It quantizes the weight of each pretrained parameter to 4 bits (from the typical 32 bits). This results in significant memory savings and enables running large language models on a single GPU
How to choose between LoRA and QLoRA
LoRA generally requires more GPU memory than QLoRA but is more efficient than full fine-tuning, making it suitable for systems with moderate to high GPU memory capacity. QLoRA, on the other hand, significantly lowers memory demands, making it more suitable for devices with limited memory resources. While LoRA is often faster, QLoRA incurs slight speed trade-offs due to quantization steps but offers superior memory efficiency, enabling fine-tuning of larger models on constrained hardware.
Accuracy and computational efficiency also differ between the two methods. LoRA typically yields stable and precise results, whereas QLoRA’s use of quantization may lead to minor accuracy losses, though it can sometimes reduce overfitting. When it comes to specific needs, LoRA is ideal if preserving full model precision is vital, whereas QLoRA shines for extremely large models or environments with tight memory constraints. QLoRA also supports varying levels of quantization (e.g., 8-bit, 4-bit, or even 2-bit), adding flexibility but at the cost of increased implementation complexity.
When deciding between LoRA and QLoRA for fine-tuning large language models, key considerations revolve around hardware, model size, speed, and accuracy needs.
Use-case
In this article, we illustrate a specific use-case: Supervised fine-tuning Qwen2.5-3B model using LoRA and QLoRA, to create and generate a story generator for children.
For the supervised aspect, we use an instruction dataset TinyStories_Instruction which contains instruction-story pairs. I have prepared this dataset in the previsous post, if you have not read it yet, I recommend you to check it out. The stories in this dataset are short and synthetically generated stories created by GPT-3.5 and GPT-4 with a limited vocabulary, making it highly suitable for our intended 5-year-old readers. While, the instruction is also created synthetically using GPT-4o-mini.
Qwen2.5-3B is a pretrained language model containing 3.09 billion parameters, making it powerful yet suitable for fine-tuning even on resource-constrained platforms like Google Colab. The goal of this guide is to explore the fine-tuning process for Qwen2.5-3B using LoRA (Low-Rank Adaptation), specifically targeting the incorporation of the Unscloth framework. While the actual fine-tuning steps will be covered in another post, here we introduce the key concepts and detailed implementations necessary to achieve this.
Implementation
To achieve the fine-tuning, we will utilize the following libraries and methods:
Step 1: Import Necessary Libraries
import os
import comet_ml
import torch
from trl import SFTTrainer
from datasets import load_dataset, concatenate_datasets
from transformers import TrainingArguments, TextStreamer
from unsloth import FastLanguageModel, is_bfloat16_supported
from google.colab import userdataStep 2: Comet ML Login
comet_ml.login(project_name="sft-lora-unsloth")Step 3: Load Pretrained Model and Tokenizer
max_seq_length = 2048
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Qwen/Qwen2.5-0.5B",
max_seq_length=max_seq_length,
load_in_4bit=False,
)Step 4: Apply LoRA Adaptation
model = FastLanguageModel.get_peft_model(
model,
r=32,
lora_alpha=32,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "up_proj", "down_proj", "o_proj", "gate_proj"],
)Step 5: Formatting Dataset
Prepare the dataset using a specific text template and map it accordingly.
alpaca_template = """Below is an instruction that describes a task.
Write a response that appropriately completes the request.
### Instruction:
{}
### Response:
{}"""
EOS_TOKEN = tokenizer.eos_token
def format_samples(examples):
text = []
for instruction, output in zip(examples["instruction"], examples["output"], strict=False):
message = alpaca_template.format(instruction, output) + EOS_TOKEN
text.append(message)
return {"text": text}
dataset = dataset.map(format_samples, batched=True, remove_columns=dataset.column_names)Step 6: Setting Up the Trainer
Utilize the SFTTrainer for supervised fine-tuning.
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
dataset_text_field="text",
max_seq_length=max_seq_length,
dataset_num_proc=2,
packing=True,
args=TrainingArguments(
learning_rate=1e-5,
lr_scheduler_type="linear",
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
num_train_epochs=3,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
warmup_steps=10,
output_dir="output",
report_to="comet_ml",
seed=0,
),
)
trainer.train()Step 7: Model Inference
Generate a response using the fine-tuned model.
FastLanguageModel.for_inference(model)
message = alpaca_template.format("Write a story about a humble little bunny \
named Ben who follows a mysterious trail in the woods, \
discovering beautiful flowers, new friends, and a lovely pond along the way.", "")
inputs = tokenizer([message], return_tensors="pt").to("cuda")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=256, use_cache=True)Step 8: Save and Push to Hugging Face Hub
from huggingface_hub import login
# Log in to the Hugging Face Hub
login(token=userdata.get('HF_TOKEN'))
model.save_pretrained_merged("model", tokenizer, save_method="merged_16bit")
model.push_to_hub_merged("tanquangduong/Qwen2.5-0.5B-Instruct-TinyStories", tokenizer, save_method="merged_16bit")Inference
Using the fine-tuned model for inference:
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("tanquangduong/Qwen2.5-3B-Instruct-TinyStories")
model = AutoModelForCausalLM.from_pretrained("tanquangduong/Qwen2.5-3B-Instruct-TinyStories")
model = model.to("cuda")
alpaca_template = """Below is an instruction that describes a task.
Write a response that appropriately completes the request.
### Instruction:
{}
### Response:
{}"""
FastLanguageModel.for_inference(model)
message = alpaca_template.format("Write a story about a humble little bunny named Ben who follows a mysterious trail in the woods, discovering beautiful flowers, new friends, and a lovely pond along the way.", "")
inputs = tokenizer([message], return_tensors="pt").to("cuda")
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer=text_streamer, max_new_tokens=2048, use_cache=True)Conclusion
This guide walked through the supervised fine-tuning process of Qwen2.5-3B using the Unscloth framework and LoRA adapters. Fine-tuning such models with cost-effective methods like LoRA makes it feasible for smaller setups, such as those utilizing Colab. The end result is a model that can generate customized responses tailored to specific use cases, such as creating Tiny Stories. This approach emphasizes the flexibility and power of modern transformer-based architectures in domain-specific tasks.